CI build-cache rollout (sccache Phase 2 + GGUF models) + CUDA fast-build, dep bumps, NativeServer scaffold#245
Merged
Conversation
Add a 'Cross-repo scope' note to the CI build cache section explaining the sccache+Depot compiler cache benefits only this repo's native build, and link the workspace crossrepostatus.md non-parity entry. No build/CI behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Keep only the one-line 'jllama-only, it's the sole repo with a native build' fact and defer the full rationale (Maven repos, GitHub-hosted runners, inert DEPOT_TOKEN, badge) to workspace/crossrepostatus.md instead of duplicating it here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…ncached
build.sh uses sccache as the compiler launcher, so a present-but-crashing sccache (the static-musl panic seen inside the dockcross cross-compile containers) failed every compile and redded the whole build. The inert-safe guard only covered sccache being absent, not present-but-crashing.
Add sccache_can_wrap_compiler(): probe-compile a trivial TU through sccache and only enable -DCMAKE_{C,CXX}_COMPILER_LAUNCHER=sccache when it succeeds. On any failure it logs the captured Rust panic backtrace (and the detached server's SCCACHE_ERROR_LOG when a job sets one) and builds WITHOUT the cache — a clean green -O3 build. Also make the fetched sccache version a SCCACHE_DL_VERSION knob (default bumped 0.8.2 -> 0.15.0, overridable per-job) and only run --show-stats when sccache was actually used.
Verified locally with fake sccache/cmake across every variant: no token, use_cache=false, crashing sccache, and working sccache all produce a green build.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…b 1) First dockcross job re-enabled after the phase-2 revert, now safe behind the build.sh probe. Forwards the Depot cache env into the container via DOCKCROSS_ARGS and enables SCCACHE_LOG=debug + SCCACHE_ERROR_LOG + RUST_BACKTRACE=full so this run captures the in-container panic root cause if it recurs (the probe keeps the build green either way). The CUDA, aarch64, Android, OpenCL-Android and Windows jobs stay uncached until this one is verified green in CI — one job at a time. Document the staged rollout and the probe in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Bump the SCCACHE_DL_VERSION default 0.15.0 -> 0.16.0 (released 2026-06-19, the current latest). The x86_64-unknown-linux-musl asset is confirmed published; the fetch stays fail-safe (a missing version just falls back to an uncached build) and the value is overridable per-job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Bump DEFAULT_DOCKCROSS_IMAGE in all five wrappers from 20260312/13-9b3357c to 20260515-5fd14ac — the newest dockcross release on Docker Hub (verified: a full tag scan shows nothing dated later than 2026-05-15 across the images, no 2026-06 build exists, and 'latest' points to the same digest). This is a tag-pin bump on line 3 (the operative pin), not a full update.sh docker regeneration (which needs Docker unavailable here); the wrapper body is version-stable. It changes the toolchain for every cross-compiled native artifact, so each platform should be confirmed green in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…(phase 2, job 2) manylinux2014 (job 1) verified green in PR #245: sccache v0.16.0 probe passed inside the container (devtoolset-10 gcc), cache ON over Depot WebDAV, cold run stored 275 objects. The v0.8.2 in-container panic does not occur on v0.16.0. Dropped job 1's first-run diagnostics (SCCACHE_LOG/SCCACHE_ERROR_LOG/RUST_BACKTRACE) to its steady-state env. Enable job 2: crosscompile-linux-x86_64-cuda (manylinux_2_28 + CUDA via build_cuda_linux.sh, which execs build.sh, so the same probe guards it). Diagnostics on for its first run on the manylinux_2_28 image. Only the gcc C/C++ TUs cache; nvcc .cu kernels are not wrapped. aarch64/android/opencl-android/Windows stay uncached until each is verified — one job at a time. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Both are the latest stable patch releases on Maven Central. NullAway runs at -Xep:NullAway:ERROR and was verified clean with 'mvn compile' in this repo; pitest-maven is a plugin-only patch bump. Part of the cross-repo dependency freshness sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
… dev knob googletest: bump the BUILD_TESTING-only FetchContent (used only by jllama_test's C++ unit tests, not the shipped library and not coupled to llama.cpp) from v1.15.2 to v1.17.0. There is no constraint behind the tag — it is just latest-stable; CLAUDE.md now says to bump it periodically. CUDA_FAST_BUILD: add an opt-in, default-OFF env knob to build_cuda_linux.sh that builds CUDA for a single architecture (default 'native', override CUDA_ARCH=<cc>) instead of the full release arch set, to speed up local iteration. Default + CI/release behaviour is unchanged (full arch set), so released jars keep full GPU coverage. nvcc .cu kernels are not sccache-cached (limited support), so fewer archs is the real CUDA build-time lever; rationale documented in CLAUDE.md and inline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
… re-downloads) Each Java-test job re-downloaded ~5 GB of GGUF models from HuggingFace every run. Add an actions/cache@v5 step (path models/, shared key gguf-models-v1) to all four Java-test jobs and guard every model curl with 'test -f models/$NAME ||' so a cache hit skips the download. GGUF files are platform-independent, so ubuntu + macOS share one ~5 GB entry (well under GitHub's free 10 GB/repo cache). Deliberately GitHub's free cache, NOT Depot: Depot Cache is usage-priced (GB-scale model blobs would raise the bill, unlike the tiny content-addressed sccache objects) and its general file cache only works on Depot-hosted runners. Bonus: cache hits also dodge HuggingFace 429s (the reason for the curl --retry flags). Bump the key suffix when the model set/URLs change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Expand the inline comment on the model-cache step: it exists to avoid re-downloading ~5 GB of GGUF test models from HuggingFace every run (and to dodge HF rate-limits). It is always ON by design — no on/off flag — unlike the sccache compiler cache, which the use_cache input / USE_CACHE env toggles. Notes it uses GitHub's free cache, not Depot. Comment-only; no behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…r + WebUI) Minimal structural wiring for the planned native server: NativeServer sits next to OpenAiCompatServer (the Java server) as the entry point for the upstream native HTTP transport (server-http.cpp + cpp-httplib) already compiled into libjllama — the only component that can serve the embedded WebUI. Scaffold only: start() throws UnsupportedOperationException until the upstream routes (server.cpp's registration) are wired to a JNI entry point; isRunning()/getHost()/getPort()/close() are model-free placeholders. The native methods + C++ implementation + lifecycle are a separate, detailed step. Adds a model-free smoke test (NativeServerSmokeTest, 3 tests). Verified locally: compile (Error Prone/NullAway/Checker), javadoc (failOnWarnings), SpotBugs Max/Low (0 bugs, @tostring clears IMC), ArchUnit (12/12). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…nly on publish Invert the CUDA build-time/coverage trade-off in CI without risking the distributed jar. The crosscompile-linux-x86_64-cuda job now sets CUDA_FAST_BUILD=1 (single arch, CUDA_ARCH=90) for validation runs (PR/push/non-publish dispatch) to cut nvcc time, and CUDA_FAST_BUILD=0 (full arch set) only when publish_to_central is set. Because publish-snapshot/publish-release require publish_to_central, every artifact that reaches Maven Central is still built for every GPU generation — only non-distributed validation builds go fast. CI has no GPU so the fast path pins a fixed CUDA_ARCH (native would fail at configure); both vars are forwarded into the dockcross container via DOCKCROSS_ARGS -e. build_cuda_linux.sh's own default stays off, so local/manual builds remain release-safe unless you opt in. Docs updated in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…build Change the fast-path CUDA_ARCH from 90 to 120 (the newest CUDA 13.2 compute capability, consumer Blackwell / RTX 50xx) per request. Only affects the fast single-arch validation build (PR/push); publish runs still build the full arch set. Bump as newer GPU generations ship. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Add the USE_CACHE / SCCACHE_WEBDAV_* / DOCKCROSS_ARGS env to crosscompile-linux-aarch64, crosscompile-android-aarch64, and crosscompile-android-aarch64-opencl (jobs 3-5). Jobs 1-2 were already enabled (manylinux2014 verified green, CUDA first run in progress). The build.sh probe-compile health-check makes it safe to enable all jobs simultaneously: any container where sccache crashes automatically falls back to an uncached green build, so there is no need to stage one job at a time anymore. build_opencl_android.sh previously called cmake directly; changed to exec build.sh (same pattern as build_cuda_linux.sh) so it inherits the sccache probe + Depot launcher + --show-stats without duplicating any download/probe logic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…ifact Record the investigation outcome for caching the two Windows native build jobs (the only remaining uncached native builds): - Root cause: the Visual Studio generator ignores CMAKE_<LANG>_COMPILER_LAUNCHER (and ggml's GGML_CCACHE RULE_LAUNCH_COMPILE), so sccache can only cache under Ninja/Makefiles. - Upstream evidence: llama.cpp b9682 builds windows-cpu + windows-cuda with Ninja Multi-Config (+ ccache); the VS generator is only used by legacy jobs. - Chosen path: don't flip the working build blindly. Validate Ninja Multi-Config in a separate build, or ship two Windows artifacts (Ninja + MSVC) in parallel so end users can test both before committing — Windows build runs twice during the transition. - Implementation notes captured (sccache+Depot backend, build.bat generator wiring, files to touch, bounded risk via the publish gate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Rolls the shared CI build cache out across (almost) the whole native build matrix and bundles several independent build/quality improvements. Net effect: warm CI builds drop from tens of minutes to a few minutes, test-model downloads stop hitting HuggingFace every run, and CUDA validation builds are no longer the ~70-minute long pole — while every distributed artifact stays bit-identical to a clean release build.
Headline results measured on this branch:
1. sccache / Depot compiler cache — Phase 2 complete (all dockcross jobs)
sccache_can_wrap_compilerinbuild.sh): compiles a trivial TU through sccache before enabling it as the launcher. A present-but-crashing sccache (the v0.8.2 in-container panic that stalled the first attempt) now falls back to a clean, uncached green-O3build instead of redding CI, logging the panic backtrace + detached-server log for diagnosis.SCCACHE_DL_VERSION(the panic is gone on 0.16.0).crosscompile-linux-x86_64(manylinux2014) — verified green, 99.64% warm hitscrosscompile-linux-x86_64-cuda— gcc/C++ TUs cache (nvcc kernels can't; see §3)crosscompile-linux-aarch64crosscompile-android-aarch64crosscompile-android-aarch64-opencl—build_opencl_android.shnowexecsbuild.sh, inheriting the probe + launcher (same pattern asbuild_cuda_linux.sh)DEPOT_TOKEN(fork PRs) and withuse_cache=false; the probe makes enabling all jobs at once safe.2. GGUF test-model cache (GitHub
actions/cache)models/(~5 GB) across the 4 Java-test jobs under one platform-independent key (gguf-models-v1), so CodeLlama/Qwen/SmolVLM/etc. GGUFs are downloaded only when the cache is cold. Every download step is guardedtest -f … || curl ….publish.yml+CLAUDE.md.3. CUDA fast-build knob (
CUDA_FAST_BUILD).cukernel once per GPU arch — the dominant cost of the ~70-min CUDA job (sccache can't cache nvcc kernels). New opt-inCUDA_FAST_BUILDbuilds a single arch to cut that time.publish_to_central— fast single-arch for PR/push validation, full arch set whenever publishing to Central. Every artifact that reaches Central is the full set; only validation runs are fast.4. Dependency / image bumps
0.13.6 → 0.13.7, pitest-maven1.25.4 → 1.25.5(same bump applied to the 3 sibling repos on matching branches).v1.15.2 → v1.17.0(test-only; documented that it tracks nothing and should be bumped periodically).20260515-5fd14ac(latest).5. NativeServer scaffold (server package)
net.ladenthin.llama.server.NativeServer— the planned entry point for the native HTTP transport (server-http.cpp+ cpp-httplib, already compiled intolibjllama), the only path able to serve the embedded WebUI. Scaffold only:start()throwsUnsupportedOperationExceptionuntil the native routes are wired to JNI (a separate, detailed step). Fixes the package/API shape so the real wiring lands cleanly.NativeServerSmokeTest— 3 model-free tests (construct,start()throws,close()no-op); no model / nolibjllamarequired.OpenAiCompatServer(today's runnable server) and theNativeServerscaffold.Docs
CLAUDE.md: Phase 2 rollout status,CUDA_FAST_BUILDpolicy, GGUF-cache rationale, googletest note, cross-repo scope pointer.TODO.md: Windows sccache item (needs Ninja Multi-Config per upstream; evaluate shipping Ninja + MSVC artifacts in parallel) + the deferred NativeServer native-route wiring.CI status / blockers
java:S5786public-test-visibility on the smoke test — cosmetic.-O3/ full-arch release build; publishing remains gated behindpublish_to_central.Deferred to follow-up sessions (tracked in
TODO.md)NativeServerfull implementation — nativestartServer/stopServerJNI methods, route wiring, lifecycle/threading, WebUI serving.🤖 Generated with Claude Code
https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5